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Abstract 

Retrocopies of protein-coding genes, reverse transcribed and inserted into the genome copies of mature RNA, have 
commonly been categorized as pseudogenes with no biological importance. However, recent studies showed that they 
play important role in the genomes evolution and shaping interspecies differences. Here, we present RetrogeneDB, a 
database of retrocopies in 62 animal genomes. RetrogeneDB contains information about retrocopies, their genomic 
localization, parental genes, ORF conservation, and expression. To our best knowledge, this is the most complete retro- 
copies database providing information for dozens of species previously never analyzed in the context of protein-coding 
genes retroposition. The database is available at http://retrogenedb.amu.edu.pl. 
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Retrogenes, for a long time considered to be not important 
copies of parental genes are nowadays called "seeds of the 
evolution/' because they made a significant contribution to 
genomes evolution (Brosius 1991). It has been shown that 
they play very important role in the diversification of tran- 
scriptomes and proteomes and may be responsible for the 
wealth of species-specific features (Betran et al. 2002; 
Balasubramanian et al. 2009; Szczesniak et al. 2011). As dupli- 
cates of their parental genes, they evolve relatively fast, so 
these genes may acquire novel functions. Retrocopies of pro- 
tein-coding genes are also known to be involved in many 
diseases (Prendergast 2001; Ciomborowska et al. 2013). 

Analyses of retroduplications have been mostly limited 
to the few mammalian model species (mainly human and 
mouse) and fruit fly (Kaessmann et al. 2009). Nonmammalian 
vertebrates have been largely overlooked in retrocopies stud- 
ies, and our knowledge of their evolution in other animals is 
even more limited. Although retrocopies are annotated in 
major genomic databases (Ensembl [Flicek et al. 2014], 
UCSC Genome Browser [Meyer et al. 2013], National 
Center for Biotechnology Information Gene [Maglott et al. 
2011]), they are often annotated just as "pseudogenes," the 
same way as duplicates originated via DNA-based mecha- 
nisms. The same problem refers to more specialized database 
Pseudogene.org (www.pseudogene.org, last accessed January 
2014). The most complete retrocopies' annotations are in 
Ensembl database; although they are very good for human 
and mouse, the quality is very poor for remaining genomes. 
There are only two databases fully dedicated to retrocopies: 
RCPedia (Navarro and Galante 2013) and HOPPSIGEN 
(Khelifi et al. 2005). However, the first one contains data 
only for a few primate species, and the latter is limited to 
human and mouse. 

We have analyzed genomes of 62 animal species to identify 
retrocopies. The search was done based on the similarities be- 
tween reference genomic sequence and proteins coded by 



multiexon genes in a given species. To increase accuracy, we 
applied several criteria to call a genomic region a retrocopy: 
Length of the alignment at least 150 bp, minimum of 50% 
coverage of parental gene, minimum of 50% identity, and 
loss of at least two introns among others (for details see sup- 
plementary file S1, Supplementary Material online). Resulting 
data set was additionally manually inspected to exclude 
potential false positives, especially copies of transposons anno- 
tated as protein-coding genes, which in some genomes totaled 
for as many as few thousands. Our strategy led to identification 
of 84,808 retrocopies, including 6,277 protein-coding genes not 
recognized previously as retrogenes. A total of 64,225 retro- 
copies identified by us are not present in the Ensembl database, 
this includes 139 retrocopies in the human and as many as 
2,205 in the mouse genome, which belong to the best anno- 
tated. Because of our stringent requirements, applied in the 
order to generate a high-quality data set, the number of iden- 
tified retrocopies in a given species is considerably lower than 
in most other databases. However, this method gave consis- 
tently good results in both, well and poorly annotated, low- 
coverage genomes, for example, alpaca or dolphin. 

The number of retrocopies differs significantly even be- 
tween closely related species, for example, 4,927 in human 
vs. 3,285 in chimpanzee. This may be resulting from differ- 
ences in annotations and from species-specific retroposition 
events. In addition, retrocopies are polymorphic and higher 
number of retrocopies in human (vs. chimpanzee) may reflect 
a large amount of human population data (Abyzov et al. 
2013). 

Retrocopies, as a second copy of the existing gene, evolve 
relatively quickly and accumulate mutations. However, many 
of them gain functionality and become subjected to purifying 
selection (Vinckenbosch et al. 2006; Yu et al. 2007). We com- 
pared retrocopies with their progenitors to single out those 
with conserved ORF, that is, without internal stop codons or 
frameshifts over the entire alignment. Conserved ORFs in 
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RetrogeneDB ID: 

Organism: 

Location: 

Status: 

Ensembl ID: 

Aliases: 

Located in intra n of: 



retro_hsap_104 

Human ( Homo_sapiens J 

2:120979499-120980552 (-) 

KNOWN_PROTEIN_CODING 

ENSG00000226479 

No gene alias available 

None 



Alignment summary 

Identity: 88.57 % 

Coverage: 100.0 % 

Frameahlfts: 0 

Stop codons: 0 



Parental gene: 
Parental gene symbol: 
Parental gene aliases: 
Parental gene description: 



ENSG00000 155934 
TMEM135A 

TMEM185A, CXorf13, FAM11A, FRAXF, ee3 

transmembrane protein 1S5A [Source: HGNC Symbol ;Acc:1 71 25] 



Genomic region (view in browser) 
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Fig. 1. Example of RetrogeneDB record with selected data. 



mammals account for 10-25% of retrocopies. In nonmam- 
malian animals, the fraction is much higher, considerably over 
50% and in some species close to 100. However, the conser- 
vation of the ORF over the length of alignment does not 
automatically imply that a retrocopy is efficiently translated, 
even if it is expressed. In selected species, we also identified 
expressed retrocopies based on the RNA-seq data. Because of 
the high similarity to parental genes, in the process of reads 
mapping, we made sure they uniquely and perfectly map to 
retrocopies (supplementary file S1, Supplementary Material 
online). This led to the underestimation of retrocopies ex- 
pression level but prevented false-positive predictions of ex- 
pressed retrocopies. Approximately 10-20% of mammalian 
retrocopies are expressed in at least one library at minimal 



level of 1 RPM (reads per million mapped). In lizard, this 
number is higher with almost 40% of expressed retrocopies. 
Majority of expressed retrocopies in marsupials, egg-laying 
mammals, and nonmammalian species have conserved 
ORFs. However, in placental mammals, the fraction of ex- 
pressed retrocopies with conserved ORF is lower, from only 
30% in human up to 65% in horse. 

All the data are stored in MySQL database (www.mysql. 
com, last accessed September 2013), and the web interface 
was developed using Django framework (www.djangoproject. 
com, last accessed January 2014). The database is available at 
http://retrogenedb.amu.edu.pl (last accessed April 26, 2014) 
and can be searched either from the retrocopy or the parental 
gene perspective. The retrocopy search can be done based on 
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the genomic localization, key words, parental gene name, and 
retrocopy ID, and results can be filtered based on the retro- 
copy type, ORF conservation, or expression. In addition, a 
JBrowse genome browser was implemented allowing retro- 
copy inspection in the genomic context (fig. 1). The search 
from parental gene perspective enables to identify all retro- 
copies of a given gene or all orthologs, which were retroposed 
in any other species. Users can also perform sequence-based 
search using BLAST tool. 

Supplementary Material 

Supplementary file S1 is available at Molecular Biology and 
Evolution online (http://www.mbe.oxfordjournals.org/). 
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